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Abstract 

I Multimodal interfaces are becoming increasingly important with the advent of mobile devices, accessibility con- 

i siderations, and novel software technologies that combine diverse interaction media. This article investigates 

^ ' systems support for web browsing in a multimodal interface. Specifically, we outline the design and implementa- 

O ■ tion of a software framework that integrates hyperlink and speech modes of interaction. Instead of viewing speech 

as merely an alternative interaction medium, the framework uses it to support out-of-turn interaction, providing a 
flexibility of information access not possible with hyperlinks alone. This approach enables the creation of websites 
that adapt to the needs of users, yet permits the designer fine-grained control over what interactions to support. 
Design methodology, implementation details, and two case studies are presented. 

O ' Keywords: Multimodal interfaces, web interaction on mobile devices, dialog processing engines, mixed-initiative 
^ interaction. 
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1 Introduction 



Computing power today is increasingly moving away from the desktop computer to mobile computing devices such 
as PDAs, tablet PCs, and 3G phones. While posing capacity limitations (e.g., screen real estate, memory), such 
devices also present possibilities for multimodal interaction via gestures, speech, and handwriting recognition. 
5^ I An area that is witnessing tremendous growth in multimodal interaction is web browsing on mobile devices. 

Technologies such as SALT (Speech Application Language Tags) and X-i-V (XHTML plus Voice) are ushering in 
the speech-enabled web - documents that can talk and listen rather than passively display content. The maturing 
of commercial speech recognition engines 1 151 has been a key factor in the emergence of this niche segment of 
multimodal browsing. 

Speech as a mode of web interaction has become important for primarily two reasons. First, speech permits 
natural ways to perform certain types of tasks and helps compensate for deficiencies in traditional hyperlink access 
(which can get cumbersome on small form-factor devices). More importantly, speech-enabled websites help improve 
accessibility for the more than 40 million visually impaired people in the world today. As a result, using speech leads 
to the possibility of a conversational user interface Q that combines the expressive freedom of voice backed by the 
information bandwidth of a traditional browser 

What exactly would we use a speech-enabled web interface for? The common use of speech on a website is to 
support navigation of existing site structure via voice |4|, in other words as an alternative interaction medium. We 
posit that this is a rather limited viewpoint and that speech can actually be used to support new functionality at a 
website. In particular, we show how a multimodal web interface can support a flexibiUty of information access not 
possible with hyperlinks alone. 



1 



Motivating Example 

Consider the following dialogs between an information seeker (Sallie) and an automated political information system. 
Dialog 1 

tn: Welcome. Are you looking for a Senator or a Representative? 
Senator. 



1 


System: 


2 


Sallie: 


3 


System: 


4 


Sallie: 


5 


System: 


6 


Sallie: 


7 


System: 



(conversation continues) 
Dialog 2 

1 System: Welcome. Are you looking for a Senator or a Representative? 

2 Sallie: Senator. 

3 System: Democrat or Republican or an Independent? 

4 Sallie: Not sure, but represents the state of Indiana. 

5 System: Well, then it is either a Democrat or a Republican, there ai^e no Independents from Indiana. 

6 Sallie: I see. Who is the Republican Senator? 

7 System: That would be Richard G. Lugar. First elected in 1976, Lugar ... 
(conversation continues) 

It is helpful to contrast these dialogs from a conversational initiative standpoint. In the first dialog, Sallie responds to 
the questions in the order they are posed by the system. Such a dialog is called & fixed-initiative dialog as the initiative 
resides with the system at all times. The second dialog is system-initiated till Line 4, where Sallie's input becomes 
unresponsive and provides some information that was not solicited. We say that Sallie has taken the initiative of 
conversation from the system. Nevertheless, the conversation is not stalled, the system registers that Sallie answered 
a different question than was asked, and refocuses the dialog in Line 5 to the issue of party (this time, narrowing 
down the available options from three to two). Sallie now responds to the initiative and the dialog progresses to 
complete the specification of a political official. Such a conversation where the two parties exchange initiative is 
called di mixed-initiative interaction l2ll . 

What would be required to support such a flexibility of interaction at a website? It is clear that system-initiated 
modes of interaction are easiest to support and are the most prevalent in web browsing today. For instance, a webpage 
displaying a choice of hyperlinks presents such a view, so that clicking on a hyperlink corresponds to Sallie respond- 
ing to the initiative. The reader can verify that the first dialog above can be supported by a three-level tree-structured 
HTML site presenting options for branch of congress, party, and state. But how can we support the second dialog, 
allowing Sallie to take the initiative at a website? This is where speech comes in. If Sallie can talk into the browser, 
she can provide unsolicited information using voice when she is unable to make a choice among the presented hyper- 
links. In addition, if the system can process such an out-of-tum input, it can continue the progression of dialog and 
tailor future webpages so that they accurately reflect the information gathered over the course of the interaction. We 
have designed many such multimodal web interfaces, one example is shown in Fig. ^ 

It is important to re-iterate that speech input is used here to provide a certain out-of-turn interaction capability 
at a website. In other words, Sallie is not merely using voice to answer the posed question (although she can do 
that too), but using it to specify unsolicited information. In the absence of such an out-of-turn facility, the website 
designer would have to anticipate various user needs and support all possible interaction sequences directly in the 
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(a) Sallie clicks on 'Senator'. 
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(b) Sallie says 'Indiana' . 
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(c) Sallie clicks on 'Republican'. (d) The dialog is complete. 

Figure 1 : A mixed-initiative interaction with a multimodal web interface. 
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HTML site structure (e.g., browse by branch-party-state order, browse by branch-state-party order etc.), or provide a 
search facility as a method for pruning web pages. The first solution is inelegant due to the mushrooming of choices, 
and the second is not desirable either since search facilities usually terminate the dialog and return a flat list of results. 
Out-of-turn interaction via speech does not clutter the interface and provides a smooth continuation of the dialog. 

We must point out that we are not supporting free-form input of all kinds, only input pertaining to specification 
aspects that are not yet solicited by the system. This can be viewed as akin to "looking under the hood, and saying a 
hyperhnk label that is deeper below." 

2 System Design 

Supporting such a natural mode of interaction in a web interface is not an easy undertaking. While technologies such 
as SALT and X-i-V enable the augmentation of speech into browsers, they either operate at a lower level of speci- 
fication than the applications considered here, hence significantly increasing programming effort, or are otherwise 
limited in their expressive power for creating and managing dialogs ifTOl . A multimodal web application must build 
on these technologies to provide flexible dialog capabilities. 

Several considerations emerge in thinking about a software system design for multimodal web interaction. First, 
it is important to have uniform processing of hyperlink and voice interaction and, when voice is used, to introduce 
minimal overhead in handling responsive versus unsolicited input. Observe that hyperlink access can only be used to 
respond to the initiative whereas voice input can be used both for responding and for taking the initiative. Further- 
more, a user may combine these modes of initiative in a given utterance - e.g., if the user speaks "Republican Senator 
from Minnesota" at the outset, he is responding to the current solicitation as well as providing two unsolicited pieces 
of information. Uniform processing of input modalities irrespective of medium (hyperlink or voice) or initiative 
(responsive or unsolicited) is thus important to support a seamless multimodal interface. Second, it is beneficial to 
have a representation of the dialog at all times, in order to determine how the user's input affects remaining dialog 
options. For instance, in Line 5 of Dialog 2 above. Independents ai^e removed as a possible party choice; in addition 
to pruning the hyperlink structure (shown in Fig. [flc)), we must dynamically reconfigure the speech recognizer to 
only await the remaining legal utterances. Third, it must be possible for the site designer to exert fine-grained control 
over what types of out-of-turn interactions ai^e to be supported. And finally, it is desirable to be able to automatically 
re-engineer existing websites for multimodal out-of-turn interaction, without manual configuration. 

2.1 Dialog Representation 

We have designed a framework taking into account all these considerations 1131 . It is based on staging transforma- 
tions im - an approach that represents dialogs by programs and uses program transformations to simplify them based 
on user input. As an example. Fig. |2l(left) depicts a representation of the dialog from SectionQin a programmatic 
notation. You can see that the tree-structured nature of the website is represented as a nested program of conditionals, 
where each variable corresponds to a hyperlink that is present in the site. We can think of this program as being 
derived from a depth-first traversal of the site. For Dialog 1 of Section^ the sequence of transformations in Fig.|2| 
depicts what we want to happen. For Dialog 2 of Section^ the sequence of transformations in Fig. |3] depicts what 
we want to happen. 

The first sequence of transformations corresponds to simply interpreting the program in the order in which it is 
written. Thus, when Sallie clicks on 'Senator' she is specifying the values for the top-level of nested conditionals 
('Senator' is set to one, and 'Representative' is set to zero). This leads to a simplified program that now solicits 
for choice of party. The sequence of Fig. |3j on the other hand, corresponds to 'jumping ahead' to nested segments 
and simplifying out inner portions of the program before outer portions are even specified. This transformation is 
well known to be partial evaluation, a technique popular- to compiler writers and implementors of programming 
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if (Representative) 
if (Democrat) 
if (California) 

if (Independent) 

if (Republican) 

if (Senator) 
if (Democrat) 
if (Indiana) 

if (Independent) 

if (Republican) 
if (Minnesota) 
/* Norm Coleman */ 

if (Indiana) 

/* Richard Lugar */ 



if (Democrat) 
if (Indiana) 

if (Independent) 

if (Republican) 
if (Minnesota) 
/* Norm Coleman */ 

if (Indiana) 

/* Richard Lugar */ 



if (Minnesota) 

/* Norm Coleman */ 

if (Indiana) 

/* Richard Lugar */ 



/* Norm Coleman */ 



Figure 2: Staging a system-initiated dialog using program transformations. The user specifies ('Senator,' 'Republi- 
can,' 'Minnesota'), in that order. 



if (Representative) 
if (Democrat) 
if (California) 

if (Independent) 

if (Republican) 

if (Senator) 
if (Democrat) 
if (Indiana) 

if (Independent) 

if (Republican) 
if (Minnesota) 
/* Norm Coleman */ 

if (Indiana) 

/* Richard Lugar */ 



if (Democrat) 
if (Indiana) 

if (Independent) 

if (Republican) 
if (Minnesota) 
/* Norm Coleman */ 

if (Indiana) 

/* Richard Lugar */ 



if (Democrat) 

if (Republican) 
/* Richard Lugar */ 



/* Richard Lugar */ 



Figure 3: Staging a mixed-initiative dialog using program transformations. The user specifies ('Senator,' 'Indiana,' 
'Republican'), in that order. 



5 



systems 13. In Fig. |3] when the user says 'Indiana' at the second step, the program is partially evaluated with respect 
to this variable (and variables for other states set to zero); the simplified program continues to solicit for party, but one 
of the choices is pruned out since it leads to a dead-end. Notice that a given program when used with an interpreter 
corresponds to a system-initiated dialog but morphs into a mixed-initiative dialog when used with a partial evaluator! 

This is the essence of the staging transformation framework: using a program to model the structure of the dialog 
and specifying a program transformer to stage it. We write the first dialog as: 

/ 

branch party state 

where the I denotes an interpreter. Similarly, the second dialog is represented as: 

PE 

branch party state 

where the PE denotes a partial evaluator. An interpreter permits only inputs that are responsive to the cun^ent 
solicitation and proceeds in a strict sequential order; it results in the most restrictive dialog. A PE, on the other hand, 
allows utterances of any combination of available input slots in the dialog. It is the most flexible of stagers. 

We will introduce a third stager, called a curryer (C) that permits utterance of only valid prefixes of the input 
arguments. The dialog represented by 

C 

branch party state 

allows utterance of either 'branch,' or ('branch,' 'party'), or ('branch,' 'party,' 'state'). In other words, if we are going 
to take the initiative at a certain point, we must also answer the currently posed question. 

These stagers can be composed in a hierarchical fashion to yield dialogs comprised of smaller dialogs, or subdi- 
alogs. This allows us to make fine-grained distinctions about the structure of dialogs and the range of valid inputs. In 
this sense, 

PE 
abed 

is not the same as PE 

PE PE 
ab cd 

The former allows all 4! permutations of {a, b, c, d} whereas the latter precludes utterances such as ^ cab d 

As a practical example of our dialog representation, consider a breakfast dialog involving specification of a {eggs, 
coffee, bakery item} tuple. The user can specify these items in any order, but each item involves a second clarification 
aspect. After the user has specified his choice of eggs, a clarification of 'how do you like your eggs?' might be needed. 
Similarly, when the user is talking about coffee, a clarification of 'do you take cream and sugar?' might be required, 
and so on. This form of mixing initiative is known as subdialog invocation |2|. The set of interaction sequences that 
address this requirement can be represented as: 

PE 
~~c c c 

ei 62 Ci C2 bi b2 

where ei , 62 are egg specification aspects, ci , C2 support coffee specification, and bi , &2 specify a bakery item. 

The staging transformations framework also specifies a set of rules that dictate how a (dialog, stager) pair is 
to be simplified based on user input. Notice that this is not as straightforward as it looks as it might require a 
global restructuring of the representation. Assume that we stage the breakfast dialog using the interaction sequence 
-< ci ei C2 ■ • • )^', the occurrence of ei is invalid according to the dialog specification above, but we will not know 
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<?xml version="l 


.0" encoding="UTF-8" 


?> 


<dialog-spec> 






<dialog id="top" 


stager="pe" next="none" type="leaf"> 


<dialog- 


item name="house" /> 




<dialog- 


item name="party " /> 




<dialog- 


item name=" state" /> 




<dialog- 


item name="seat"/> 




<dialog- 


item name="district " 


/> 


</dialog> 






</dialog-spec> 







Figure 4: DialogXML for the U.S. Congresspeople site. This dialog is staged by a partial evaluator (PE) and consists 
of five specification aspects. 

that such an input is arriving at the time we are processing ci. So in response to the input ci, the dialog must be 
restructured as follows: 

PE C 

C C TT C PE ~ 

ei 62 ci C2 bi b2 C2 ZSZZ^T 

<=i =2 H '>2 

By replacing the top-level PE stager with a C, it becomes clear that the only legal input now possible must have C2- 
Once the coffee subdialog is completed, the top-level stager will revert back to a PE. Such dialog restructurings are 
necessary if we are to remain faithful to the original specification. See ||3| for formal algorithms to perform such 
dialog restructurings. 

At this point, it must be clear that the staging transformation framework is a powerful representational basis to 
design dialogs: it has a uniform vocabulary for denoting specification aspects (e.g., each of the slots above could be 
filled via hyperhnk chcks or by voice) and the use of stagers helps us control the mixing of initiative in a very precise 
manner. 

In order to make the staging notation machine-readable, we have defined an XML representation of dialogs called 
DialogXML. Fig. |4] depicts a minimal DialogXML specification for the politicians example. DialogXML provides 
elements for defining the slots associated with a dialog, the textual prompts associated with each slot (not shown in 
Fig. m, the accompanying vocal prompts and any tapering of them over the course of interaction (also not shown), 
confirmatory characteristics (whether the user's response needs confirmation), and constructs for combining basic 
dialog elements to create complex dialogs. More details about DialogXML and the possible legal specifications are 
available in |l3j|. While DialogXML bon^ows ideas from some tags in the VoiceXML standai'd i6j, the structure of 
the DialogXML document is more closely modeled after the idea of stagers' . 

2.2 Site Creation and Content Generation 

Having such a representation of the dialog is only the first step. We must create a site to reflect the underlying structure 
of the dialog and initiate speech processing to recognize the legal utterances as specified in the stager markup. Based 
on user input, we must simplify the dialog and present personalized content, including facilities for continuing the 
dialog. 

An integrated software framework for this purpose is shown in Fig. |5] Operationally it can be divided into 
four modules: (i) seeding dialog representations, (ii) staging dialogs, (iii) input and output processing, and (iv) 

'it has been recently brought to our attention that there is a similar technology with the same name |8|. This work, however, is an extension 
of the VoiceXML standard for voice interfaces and does not address multimodal interaction. 
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Figure 5: Multimodal out-of-turn interaction framework architecture. 

database connectivity. The idea of seeding dialog representations is to take a dialog specified in our DialogXML 
notation and create an internal representation, suitable for staging. The staging transformation framework is then 
used to handle dialog management. The embodiment of these dialogs must contend with voice realizations, hence a 
significant portion of the framework is devoted to generating grammars for speech output and validating voice input. 
The framework currently uses the SALT and Speech Recognition Grammars standards to handle voice interaction. 
Finally, the database operations module manages and streamlines the delivery of web content. Every website is 
organized as a database that the user initially selects for exploration via out-of-turn interaction. Each record in the 
database identifies a unique interaction sequence leading to a leaf web page (e.g.. Senator Norm Coleman is identified 
by a record that describes his political affiliations and other addressable attributes). 

When the interaction begins, the dialog manager uses the parsed DialogXML and metadata from the database to 
initialize the representation. The dialog manager must then decide what content to display on the page, the items to 
offer the user for selection, and the speech prompts to play for the current slot. In addition, the dialog manager must 
determine what aspects the user may specify out-of-turn. Through an analysis of the dialog representation and a set 
of SQL queries, it determines this information and generates a HTML page that contains relevant SALT XML objects 
and references to a suitably generated SRGS grammar. The grammar identifies all the legally speakable utterances 
for the particular page. 

User input from both voice and hyperlinks is uniformly handled by the Utterance Validator module of the system. 
Upon receiving an utterance from the user, the module first determines whether the utterance contains fillings for 
multiple slots, and whether the utterance is valid. If part or whole of the utterance was invalid, then the system 
accepts the valid utterances and rejects the invalid utterances. An appropriate prompt is displayed and played to the 
user. Having tokenized the user's utterance into its constituent fillings, the dialog manager then calls the staging 
transformer with the values for the fillings in the order they were received. After the representation is simplified, the 
dialog manager applies a suite of dialog motivators |lj (discussed below) to the dialog. If the dialog is not completed, 
content creation and speech grammar generation resumes. 

The system is implemented as a Java web application using JSPs and Servlets. The application runs inside the 
Tomcat servlet engine. The system uses the Apache HTTPd web-server with a WARP connector to connect to the 
Tomcat servlet engine, and functions like a proxy-server. A PostgreSQL database server serves the example databases 
we use in this article. The web-application connects to the database server using the Java Database Connectivity 
(JDBC) API. It uses a meta-data API to learn about the structure of the data present in the database. An SQL query 
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initially helps compute a VIEW that serves as the starting point for the dialog. This VIEW is used for all future 
interactions and helps reinforce that a user is always working with a personalized 'view' of the information space. 
The use of VIEWs can be used to increase system efficiency as they can be shared across many users. We tested the 
system using the Microsoft SALT plug-in for the Internet Explorer 6.0 browser on a system running Windows XP. 

2.3 Dialog Motivators 

The only aspect of the architecture to be covered are the dialog motivators and the grammar generators. Dialog 
motivators are useful nuggets of processing that help streamline the dialog at every user utterance. We use four main 
motivators: 

1. complete-dialog: This motivator decides if a dialog is complete. A dialog is complete if a unique record in 
the database VIEW being used has been identified, or if there are no more items left to solicit input for in the 
dialog. In such cases, the unnecessary slots ai^e removed from the representation. 

2. prune-dialog: This motivator decides if the internal dialog representation can be pruned as a result of the 
previous utterance by the user. For example, in the case of a pizza dialog, if the user specifies a size of 'small,' 
and only pepperoni pizzas available in small size, there is no longer a need to ask the user for a topping. Thus 
the topping slot can be automatically filled with 'pepperoni' and removed from the dialog representation. While 
the current implementation does not provide the user with notification when a dialog is pruned in this fashion, 
such user-feedback is being considered for future work. 

Notice that complete-dialog is a specialization of prune-dialog. 

3. confirm-dialog: This motivator applies if the item has been designated in the DialogXML as one for which 
confirmation must be sought. In a real application, confirmation would be sought for utterances that have been 
recognized with a low value of confidence; however SALT does not provide us with the hooks to leai^n about 
confidence values of recognized utterances, hence we specify the need for confirmation in the DialogXML 
markup. 

4. collect-results: This motivator applies if the user explicitly requested (via a 'Show me results' utterance) that 
the dialog be terminated in order to view a flat listing (of the relevant remaining records). 

2.4 Grammar Generation 

Grammar generation proceeds in a straightforward manner except for a careful division of labor between the browser, 
utterance vaUdator, and the embedded speech grammars themselves. For instance, the system generates JavaScript 
to handle some types of interaction on the client side within the browser itself. Confirmation of the user's utter- 
ance, 'What may I say?,' and 'Show me something else' type of questions are examples of interactions handled by 
JavaScript. The embedded SRGS grammars are used for encapsulating site-specific logic and are faithful to the C and 
/ stager specifications. Generating grammar fragments corresponding to the PE stager will result in an exponential 
enumeration of utterance possibilities, so we use a less restrictive grammar and allow the invalid utterances to be 
caught by the Utterance Validator instead. 

3 Example Applications 

Two applications have been created using the out-of-turn interaction framework presented above. The first is an 
interface to the Project VoteSmart website (http://www.vote-smart.org ) and an example interaction has been already 
described in Fig.Q This interface provides details on about 540 politicians comprising the U.S. Congress. 
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(a) An out-of-tum utterance. (b) The dialog is complete. 

Figure 6: An interaction sequence using the multimodal web interface to the Fuel Economy Guide. 



The second application is an interface to the fuel economy guide at the environmental protection agency (EPA 
-"http://www.fueleconomy.gov ). Th EPA provides raw data on fuel economy statistics about cars available in the 
United States in a comma separated format (CSV). For this article, we downloaded and reformatted data from the 
past three years (2000, 2001, and 2002) and loaded it into a PostgreSQL database. The dataset has a total of 2641 
records, which translates to approximately 880 records. Upto 26 different attributes can be specified in any interaction 
with this database. We organized a dialog around three subdialogs. The first is an engine subdialog which solicits 
(fuel type, information about whether the engine is a gas guzzler, and if it is equipped with a turbo charger and/or 
a super chai^ger). Another subdialog solicits information about the transmission (whether it is automatic or manual) 
and the drive (4 wheel or all wheel). The main specification aspects for the car (year, manufacturer, model, number) 
are included in another subdialog. While there are other ways to organize this information, we initialized the dialog 
representation as: 

PE 

PE ■ ... . ■ . . . 

year class maker model fuel gas super charged? turbo charged? transmission drive 

An example interaction is shown in Fig.|6l It depicts an expert user who knows exactly what he wants, and as a result 
does not need to engage in a dialog with the system. In a single utterance, he specifies three pieces of information 
that uniquely identify a car in the database. The dialog manager has applied the prune-dialog motivator to the items 
in the remaining dialog as these specification aspects are no longer necessary. The system redirects the user to a 
leaf page, where the user is able to see information about Ford Escort cars manufactured in 2000. This example also 
shows how the user is able to specify multiple utterances while interacting with the system. 

We hasten to add that, for ease of presentation, the screenshots in this article were taken on a Microsoft PocketPC 
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simulator. As of this writing, PocketPC does not support SALT and these case studies were actually tested using 
Microsoft Internet Explorer with the SALT plugin. 

4 Discussion 

The above applications have highlighted the key features of our software framework. The staging transformers have 
been primarily responsible for dialog control, specifying what the user may say at any given time. The dialog manager 
has streamlined the dialog, pruning it when necessary, and triggering the appropriate actions. Our modularized 
implementation approach makes it easy to construct speech-enabled interfaces to database-driven sites. For want of 
space, we have not demonstrated several other features of our system such as user response confirmation, tapered 
prompting, and results collection. 

This work helps demonstrate the viability of our view of the speech-enabled web - namely, that of a flexible 
dialog between the user and the system which allows the user to take the initiative in controlling the flow of the 
dialog. Rosenfeld et al. 1121 have argued that speech interfaces will become increasingly ubiquitous and will be able 
to support smaller form-factors without comprising usability. The applications presented here validate this viewpoint 
and help illustrate the importance of using voice to supplement interaction in mobile devices. 

It is helpful to contrast our representation-based approach with other ways of specifying dialogs, notably VoiceXML. 
While they share some similarities, our DialogXML notation is purely declarative and captures only the structure of 
the dialog. Control is implicitly specified using program transformations, which makes the process of dialog speci- 
fication less cumbersome for the designer. Furthermore, while VoiceXML permits mixed-initiative dialog sequences 
too, it does so more as a result of how its form interpretation algorithm (FIA) is organized. Using a combination of 
program transformation constructs and hierarchically composed dialogs, we are able to specify the nature of out-of- 
turn interaction in a manner not precisely expressable in VoiceXML (see ITll for more details). 

The successful implementation of a dialog-based system 1141 ITTl requires many more facets such as language 
understanding, task modeling, intention recognition, and plan management, which are beyond the scope of this work. 
We are now exploring several directions such as natural language speech input and extending the specification capa- 
bility of DialogXML. We are also conducting usability studies for our multimodal interfaces and carefully assessing 
the role of speech as an out-of-turn interaction medium. Especially important is addressing the veritable 'how do 
users know what to say?' problem tl6i for multimodal web browsing. 

This work is an initial exploration into the use of multimodal interfaces to websites. As the use of browsers 
that support technologies such as SALT, X-i-V grows, the importance of software frameworks to support multimodal 
interaction will only increase. 
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